1 Introduction

We will pick up where we left off in Demo 2 to prepare you for analysis. Although this is not going to be as exhaustive as we would like to do if we were doing a complete project, our goal is to see this project go from messy data \(\rightarrow\) prediction & interpretation. After all, we need to give something to the patient advocacy watchdog that they can deploy!

1.1 Our Analysis Plan

Our goal is to see this analysis through to generate a predictive model that the advocacy group can use, and hopefully it will be a decent one. To ensure that it stands the best chance of being a good predictive model, our analysis plan includes all the key steps in the data science life cycle/project flow, including: cleaning, exploration, feature selection, pre-processing, unsupervised machine learning exploration, and supervised modeling. Whew!

Our specific plan is to perform two big analyses in the hopes of having a nice deliverable to present to our client. The first will be a segmentation analysis, or clustering analysis, which will attempt to group our hospitals together on the basis of their attributes and their pneumonia-related readmissions. Thie will be done using a combination of PCA and \(k\)-means. Our second analysis will be to perform supervised, predictive analysis using an elastic net, which conveniently combines a penalized regression with variable selection so we can point our clients down a path of hospital attributes to focus on, as well as a model they can use to make future predictions.

1.2 Your objective.

Your oject is to work through this Project 2. You will have 31 Questions to answer as you work through this, just as with Project 1. Then, you will be able to choose your own adventure again to perform your own clinical informatics analysis:

Adventure 1. [Minimal coding, focus is on understanding cleaning, pre-processing, and the ML analyses we performed here.]

Choose ONE other condition (anything other than pneumonia) and run through the analyses again, using our same target variable. You will update your question, hypotheses, and predictions, but will be able to re-use all of the other code with very minor modifications. Credit will be based more on interpretation and not on coding.

Adventure 2. [More coding, but not as much, with a focus on the steps we’ve undertaken here and extending it to a new question & condition.]

Choose a slightly more complicated analysis to undertake, for example, focusing on surgical interventions (HIP-KNEE & CABG) or heart-related conditions (HF, AMI, & CABG). Coding should still be fairly minimal, but you are likely to run into problems with the code working exactly as-is, especially during cleaning. The advantage of this analysis is that it will be much more robust, have larger sample sizes, and would allow us to be able to say something far more informative to a class of patients (e.g., those considering undergoing surgical interventions).

Adventure 3. [Most coding, but no cleaning or pre-processing, with an emphasis entirely on the machine learning analyses.]

Continue with the pneumonia-related data that you have here, completely cleaned and pre-processed, and do two more unsupervised OR supervised analyses. For example, you could do another clustering method (e.g., kNN) and compare that to the clusters we will get in our analysis, or you could do a random forest like we did in the last module, or you could choose to do a Ridge and LASSO regression and compare those analyses to what we will get out of an Elastic Net. Any of these are fine, but you must explain your decision in your submission.

1.3 Code to help.

This time, I am doing things a little different from Project 1 to show you something new. Here, we are using the source() function to effectively use R like an object-oriented language. We can write code to do more tedious tasks, or even make our own R package(!) [not shown here] to deal with common or tedious tasks we do. The linked answers_demo2.R file contains all of my code to do the cleaning tasks from Demo 2, including my answers to the big functions I asked you to write in Questions 12 & 15).

For Project 2, you will have the option to modify and run the code in the answers_demo2.R file and run it in your markdown file, as I do below. Now, you don’t have to run the code right now unless you want to see it in action; if you do, make sure to give it a couple minutes to run, as it’s doing a lot in one fell swoop. Or, you can just skip ahead to the next section and load the data.

1.4 Load data from the end of the demo.

Load my version of the fully cleaned, encoded, & ready-to-proceed dataset from the end of Demo 2. These data are not yet ready for analysis (!) but we are picking up with that here.

load("pneumoniaAnalyzefromDemo.Rdata")

2 Exploratory Data Analysis (Continued)

2.1 My solution: Which of the possible target (outcome) variables should we use?

My answer to Question 18 from the Demo.

Now, I know I didn’t ask you to do it, but I will actually begin my assessment of which target is most appropriate by constructing a quick summary table. In my assessment of what is an appropriate target, I am including ComparedToNational_Hospital return days for pneumonia patients because this is our best assessment of how the hospital is performing relative to a national average. If we were interested in this as the target, we would need to perform a multinomial classification analysis - totally doable, but perhaps beyond our scope right now. :) So, instead, let’s use this category to help us contextualize our hospitals as we choose between Predicted, Observed, and Expected readmission rates, as well as the Excess Readmission Ratio.

I am also going to quickly break down the differences between the possible targets so we can make sure we really understand what they are measuring to allow us to make our most-informed selection of a target. Note: More information was found in this PDF.

Table 1. List of possible target variables from the pneumonia-related hospital readmission data. Which one to choose?
Possible Target Description
ComparedToNational_Hospital return days for pneumonia patients Hospital return days measures the number of days patients spent back in the hospital (in the emergency department, under observation, or in an inpatient unit) within 30 days after they were first treated and released for pneumonia. Reported as compared to the national average, such that ‘below average’ is better and ‘above average’ is worse.
ExpectedReadmissionRate The expected number of readmissions in each hospital is estimated using its patient mix and an average hospital-specific intercept. It is thus indirectly standardized to other hospitals with similar case and patient mixes.
PredictedReadmissionRate The number of readmissions within 30 days predicted based on the hospital’s performance with its observed case mix. The predicted number of readmissions is estimated using a hospital-specific intercept, and is intended to reflect the annual expected performance of the hospital given its historical case and patient mix and performance.
ExcessReadmissionRatio The ratio of the predicted readmission rate to the expected readmission rate, based on an average hospital with similar patients. Performance is compared against a ratio of one, such that below one is better and above one is worse in terms of readmission rates.
observed_readmission_rate Our calculation of the observed number of pneumonia-related readmissions within 30 days found by dividing the number of readmissions observed for the hospital during the given period by the number of discharges for that same period, and multiplied by 100 to put it on the same scale as the Predicted and Expected Readmission Rates. It reflects a true observation as reported by the hospital during that period, but is not adjusted for case mixes or prior information for that hospital. Thus, this is a crude, unadjusted value.
Table 2. Possible target measure summaries
Characteristic N = 4,8161
National Comparison Return Hospital Days
    Better than average 402 / 4,816 (8.3%)
    Same as average 2,285 / 4,816 (47%)
    Worse than average 827 / 4,816 (17%)
    Unknown 1,302 / 4,816 (27%)
Predicted Readmission Rate 16.46 (1.93), 16.27
    (Missing) 2,090
Observed Readmission Rate 17.1 (3.7), 16.9
    (Missing) 2,603
ExcessReadmissionRatio 1.00 (0.06), 1.00
    (Missing) 2,090
Expected Readmission Rate 16.43 (1.51), 16.39
    (Missing) 2,090
1 n / N (%); Mean (SD), Median

We can see right away that we lose nearly 600 more data points if we opt to use our calculated observed_readmission_ratio, which is going to be a by-product of the fact that the Predicted and Expected readmission ratios are using prior information and information about patient mixes to generate these values. Our observed_readmission_ratio is crude and unadjusted in any way - which could present problems when it comes to true comparability.

2.2 My answer to Question 18 from the Demo.

If I know I want to facet plots, the first thing I have to do is reshape the data I want from wide into long format using, for example, the pivot_longer() function (the complement of pivot_wider() that we used in the Demo).

I don’t have to store it as a separate data frame (i.e., I can just pipe it directly into ggplot()!) but here I am choosing to so we can more easily make sure it’s reshaped correctly along the way. Thus, I am making a long-format version of the possible target variables called longDataTarget so that I can better decide which target to choose.

I mentioned violin and density plots specifically in Question 18 of Demo 2, as well as faceting. I am going to show you two ways to explore these data with these types of plots and you can decide which one(s) you find most informative here. You also may have thought of a better way to display the data too!

Example with geom_violin():

Here I am leaving the NA data (which I relabelled as Unknown) to get a sense for the degree of missingness.

Example with geom_density():

However, this time I am going to ignore the “Unknown” (i.e., missing) class of

2.5 Final remarks on target selection.

I am sure you have noticed by know that, even though I made kind of a big deal about how well our crude measure seemed to reflect the number of hospital days, I ultimately seem to have chosen the Predicted Readmission Rate. In a full analysis, we might consider further exploring a best choice - or even circling back to our stakeholder and trying to refine our question - as we try to choose a target. The one thing that becomes clear, I hope, from these large, national studies is that using the adjusted values will generally always be a better choice than the crude, even if our crude measure captures something interesting about the data, as we did here. Our crude measure, observed_readmission_rate did the best job of recapitulating the Number of Hospital Days, but if that’s really our focus, wouldn’t it then be better to switch from a readmission rate to the number of hospital days?

If you still feel unsure about the best choice, you’re not alone - so do I! This is where, because we are working with an imagined stakeholder, we must make the best decision we can given the question we defined in the Demo. I would defend my choice of Predicted Readmission Rate for three key reasons:

  • It has been adjusted to be hospital-specific, using a hospital-specific intercept and historic data about the hospitals’ performances
  • We don’t lose nearly 600 data points to missingness.
  • Our imagined stakeholder, and the question we set, was about readmission rate so even though return hospital days might also be a similarly viable candidate, it doesn’t reflect our imagined stakeholder’s request. However, the ExpectedReadmissionRate is standardized for similar hospitals, and may not truly reflect a given hospital; the ExcessReadmissionRatio, while informative, was the least likely to reflect the distinctiveness in hospital return days, which is still of interest to us.

But remember that what matters most is that we are matching our question.

Bias and Limitation

Question 3: [1 point]

All studies have bias (introduced or systemic error) and limitations (failure to fully explain something). We could go more deeply into these concepts, but for now I want you to take a stab and just brainstorming either ways we might have bias OR the limitations of using this target variable. You do not need to answer both, but you’re welcome to do so. You can read more about bias here or see some deeper explanations of limitations here. You do not need to write a lot; just 2-3 sentences. I am simply asking you to pause and reflect on our choices here before we proceed.

Your answer here.

3 Pre-processing and Feature Selection

The time has finally come to officially move back into pre-processing and feature selection so that we can finally build some models. I want you to take note of the order in which I’m doing things here; the flow can change from project to project, but generally it’s a good idea to get your dataset to a state of manually curated features before doing imputation and certainly before you’d make your train-test split. Another big thing to notice is that any scaling would be done AFTER a train-test split usually, with the exception of unusual situations like our gene expression analysis in Weeks 1 & 2.

3.1 Further feature selection and refinement.

3.1.1 Dropping the targets we aren’t planning to use. [Manual Method]

It’s time to say goodbye to the features we know would be far too redundant or non-informative given our target of PredictedReadmissionRate. We don’t need any of the Sample_ columns, for example; these are merely sample sizes. Useful if we were doing standardization, but not necessary for us right now. We also are dropping the other possible target features that are too redundant with PredictedReadmissionRate.

pneumoniaAnalyzeDemo <- pneumoniaAnalyzeDemo %>% 
  select(-ExpectedReadmissionRate, 
         -ExcessReadmissionRatio,
         -observed_readmission_rate, 
         -contains(c("Sample_", "NumberOfPatients")),
         -NumberSurveysCompleted) %>% 
  rename(`Median time (minutes) patients spent in ED` = `Score_Average (median) time patients spent in the emergency department before leaving from the visit A lower number of minutes is better`,
         `Composite patient safety` = `Score_CMS Medicare PSI 90: Patient safety and adverse events composite`,
         `ComparedToNational_Composite patient safety` = `ComparedToNational_CMS Medicare PSI 90: Patient safety and adverse events composite`,
         `Healthcare workers given 1st COVID-19 vaccine` = `Score_Percentage of healthcare personnel who completed COVID-19 primary vaccination series`)

## Clean up the column names quickly - remove "Score_", "HcahpsLinearMeanValue_", "_Payment"
colnames(pneumoniaAnalyzeDemo) <- gsub('Score_|HcahpsLinearMeanValue_|_Payment', '', colnames(pneumoniaAnalyzeDemo))
Question 4: [1 point]

Why do you think I choose to keep Hospital return days for pneumonia patients and ComparedToNational_Hospital return days for pneumonia patients? Do you agree with my choice? Why or why not? What analytic pitfalls are there to including them?

Your answer here.

3.1.2 Dropping any features with near-zero variance. [Automated Method]

Next, we are going to take advantage of the nearZeroVar() function that is part of the caret package. We will drop any of the columns with near-zero variance. This is a way of automating (and thus standardizing) the process for choosing to drop non-informative columns due to zero or nearly zero variance.

Table 3. Columns dropped due to near-zero variance.
Variable
PaymentCategory for pneumonia patients
ComparedToNational_Postoperative respiratory failure rate
ComparedToNational_Perioperative pulmonary embolism or deep vein thrombosis rate

We can see that we lost three more features due to near-zero variance, bringing our total number of features down to 33. We’re getting closer to something manageable! You also may not have caught this yet, but I had moved FacilityID to our rownames – this just gets that out of the way too.

Question 5: [1 point]

If you are not familiar yet with the concept of zero or near-zero variance, do a little digging. Based on what you read, try to explain for your naive advocacy group stakeholders, why it’s important to remove the features with little variance.

Your answer here.

Question 6: [1 point]

Why do I want to make sure unique identifiers, like FacilityID, are out of the dataframe for before I either run nearZeroVar() or proceeding with the rest of the analysis? What problems does keeping them in there create?

Your answer here.

3.2 Assess missingness & devise an imputation plan

We are going to deal with imputation in a rather quick and elegant way that requires minimal effort from us. Why? We aren’t focusing much on imputation this time, even though we could and our cursory treatment isn’t to suggest that it isn’t important! Instead, I want us to focus on the remainder of the analysis as we work through the data science analytical steps on the hospital readmissions data. Thus, I absolutely encourage you to look more deeply into imputation effects and options, but it’s beyond our scope during this project.

We are going to leverage the packages mice and missForest for assessing and dealing with missingness. I am sure you have already noticed how much of the dataset is missing. Ouch! Now, we do not want to impute anything with our target, PredictedReadmissionRate, so we will want to make sure to drop that before imputation. However, we can take a look at it with regard to the remainder of our variables to assess the extent of the missingness problem.

Use the md.pairs() function from the mice package to assess the extent of missing values before imputing.

I am dropping FacilityID and State from the matrix, as they are not missing and can actually create errors. I am at this stage keeping PredictedReadmissionRate in the matrix so we can see with which other variables it has the higher correlations of missingness. Why? This will let us know where we are going to lose other information so we can make informed decisions about whether we want to keep those features going forward or not. What we’re looking for here is a pattern of missingness such that for each pair of columns \((Y_j, Y_k)\) the proportion of shared missingness.

# Drop ID and State and analyze the missingness in pairs using the mice package
p <- md.pairs(pneumoniaAnalyzeDemo[,-c(1:2)])
## Calculate pairwise pattern of missingness, where value is missing in both columns being compared.
df <- as.matrix(p$mr/(p$mr+p$mm))

Darker colors mean higher degrees of correlated missingness. So, for example, we see a pretty high rate of missingness in the percent of healthcare workers correlated with most other variables in the dataset.

Question 7: [1 point]

What columns seem to be most correlated in terms of missingness with our target, PredictedReadmissionRate? Does this give you any cause for concern?

Your answer here.

Drop the rows that are missing from the target variable.

We will not be able to use those rows at all moving forward, sadly. That is going to reduce our sample size pretty sizably, from \(N = 4,816\) to only \(N = 2,726\).

## [1] "Rows: 2726"  "Columns: 33"

At this stage, we are ready to impute - but before we do that, we actually ALWAYS want to do our data partitioning first! (Same is true for centering and scaling, which we will do after imputation).

3.3 Split into training & testing sets.

You may recall that I mentioned that the optimal ratio is has been argued to be \(\sqrt{p} : 1\), where \(p\) is the number of parameters (which may or may not equal your predictors depending on your planned analysis). Note that you do not have to split only a single time; you could make different splits for different analyses, if needed. However, we are going to use a single split here for convenience.

Now, wouldn’t it be nice to have a handy function that could calculate the optimal split ratio for us? So, let’s write one! This is something you can use with ANY analysis going forward!

Question 8: [2 points]

There are three parts to this question to get it to all come together correctly. Practice reading R code

  • Comment the function to explain what it does (I’ve defined the arguments for you)
  • Run the function (fill in the blank)
  • Fill in the missing piece in the partitioning chunk below
calcSplitRatio <- function(p = NA, df) {
  ## @p  = the number of parameters. by default, if none are provided, the number of columns (predictors) in the dataset are used
  ## @df = the dataframe that will be used for the analysis
  
  ## COMMENT HERE
  if(is.na(p)) {
    p <- ncol(df) -1   ## COMMENT HERE
  }
  
  ## COMMENT HERE
  test_N <- (1/sqrt(p))*nrow(pneumoniaAnalyzeDemo)
  ## COMMENT HERE
  test_prop <- round((1/sqrt(p))*nrow(pneumoniaAnalyzeDemo)/nrow(pneumoniaAnalyzeDemo), 2)
  ## COMMENT HERE
  train_prop <- 1-test_prop
  
  ## COMMENT HERE
  print(paste0("The ideal split ratio is ", train_prop, ":", test_prop, "training:testing"))
  
  ## COMMENT HERE
  return(train_prop)
}

## Fill in the blanks to run:
# train_prop <- ___(df = ___)

Hint: One important note here is that if the number of parameters are not provided (which is true for the starter code I gave you to run it - it does NOT pass any parameters to that argument!), then by default our function is designed to take the number of columns of the dataframe and subtract one for the target variable. Neat, huh?

You should have gotten an 83-17 split. Notice how that’s actually very close to the canonical 80-20 split!

# ind <- createDataPartition(pneumoniaAnalyzeDemo$___,                  ## Remember to put the target here!
#                            p = ___,                                   ## Remember to use the object you just made from your function!
#                            list = FALSE)
# 
# train <- pneumoniaAnalyzeDemo[___, ]
# test <- pneumoniaAnalyzeDemo[-c(ind), ]

Load the test and train data in case you struggled with Question 7 so you can proceed:

load(file = "pneumoniaTrain.Rdata")
load(file = "pneumoniaTest.Rdata")

3.4 Impute missing variables using missForest.

We are going to take advantage of the missForest() function from the missForest package. In a nutshell, what missForest does is it will fit either a regression- or classification-based random forest model and use the OOB results to predict and fill in NAs. The advantage is that this process is non-linear, done in just a few lines of code, and can handle mixed data-types (although you need to make sure that all your numerical types are of the same numerical type, and that all characters or factors are of the same type).

The way I have chosen to do this is to separate the data so that I can turn the encoded categories back into categories temporarily, then stitch them back together into a temporary dataframe so that I can run missForest. Notice that I temporarily drop the State and target features; this is to spare ourselves a little computational time. Be patient. As this is randomForest, it could take a minute or two to run.

Let’s look at a quick summary of the results.

Table 4. missForest OOB Error Rates for the imputed variables, training dataset
OOB Error Variable
MSE
0.29 MRSA Bacteremia
1788880.42 Payment for pneumonia patients
18.73 Abdomen CT Use of Contrast Material
3.90 Death rate for pneumonia patients
7.03 Postoperative respiratory failure rate
0.45 Perioperative pulmonary embolism or deep vein thrombosis rate
0.01 Composite patient safety
0.00 Medicare spending per patient
83.97 Healthcare workers given 1st COVID-19 vaccine
262.51 Healthcare workers given influenza vaccination
1447.88 Median time (minutes) patients spent in ED
4.07 Left before being seen
70.06 Venous Thromboembolism Prophylaxis
24.94 Intensive Care Unit Venous Thromboembolism Prophylaxis
207.45 Hospital return days for pneumonia patients
0.97 Nurse communication
1.83 Doctor communication
4.47 Staff responsiveness
4.42 Communication about medicines
4.69 Discharge information
1.13 Care transition
11.84 Cleanliness
13.62 Quietness
0.96 Overall hospital rating
2.55 Recommend hospital
0.00 ExcessReadmissionRatio
PCF
0.00 SurveyResponseRate
0.49 Emergency department volume
0.13 ComparedToNational_MRSA Bacteremia
0.08 ComparedToNational_Death rate for pneumonia patients
0.06 ComparedToNational_Composite patient safety
0.09 ComparedToNational_Hospital return days for pneumonia patients
Question 9: [1 point]

What does MSE and PCF indicate here? You may find looking at the missForest Vignette helpful. How is the random forest that was performed here similar to what we performed on the gene expression data? How does this differ?

Your answer here.

We are almost done. The last thing we need to do is extract the imputed values from imputed_data$ximp, and add the State and target variables back on. Then, we need to turn the factor features back into numbers for our next analyses.

Notice a bit of annoying trickery here. Because we had converted them to a factor type (as required by missForest), in order to back-convert them correctly to our original ordinal encoding, we had to first convert to character and then to numeric. So, our conversion went factor \(\rightarrow\) character \(\rightarrow\) numeric in order to get exactly what we wanted!

imputedTrain <- imputedTrain$ximp %>% 
  ## Put the original State and target variables back on
  mutate(State = train$State,
         PredictedReadmissionRate = train$PredictedReadmissionRate) %>% 
  ## Convert the factors back to character type...
  mutate_at(c("Emergency department volume",
              "ComparedToNational_MRSA Bacteremia",
              "ComparedToNational_Composite patient safety",
              "ComparedToNational_Death rate for pneumonia patients",
              "ComparedToNational_Hospital return days for pneumonia patients"),
            as.character) %>% 
  ## ... so we can then convert them back to numeric type.
  mutate_at(c("Emergency department volume",
              "ComparedToNational_MRSA Bacteremia",
              "ComparedToNational_Composite patient safety",
              "ComparedToNational_Death rate for pneumonia patients",
              "ComparedToNational_Hospital return days for pneumonia patients"),
            as.numeric)

## Save data files so they can be loaded for future use 
# save(imputedTrain, file = "imputedTrain.Rdata")
Question 10: [1 point]

Now repeat the imputation steps for the test dataset, storing it into imputedTest.

# > Your code here

If you need help, load these data for help, to turn off imputation, and/or save time.

Question 11: [1 point]

Quickly explore how well the imputation did by choosing at least one numeric variable and one of the categorical variables. Did the distributions or frequencies change drastically?

# Your code here.

Your answer here.

Struggling? Here’s my solution:

# source(file = "answer_question11.R", echo = TRUE)

3.5 Transformation & Scaling

Ultimately, we know we will need to scale & center our data, as that is required for the analyses I have set out in our Analysis Plan. But before we get into that, it would be nice to know what realm of analyses in which we fall and, more importantly, do we need to consider transformation of our target in addition to the scaling & centering we plan to do on the entire dataset.

In other words: is our target variable approximately normally distributed?

Question 12: [1 point]

Why is a histogram, or even a Q-Q plot, an insufficient way to assess whether a distribution is approximately normal?

Your answer here.

Question 13: [1 point]

Apply a Shapiro-Wilk test for normality using the shapiro.test() function. What does it indicate about your distribution? (Note: Shapiro tests are only reliable for \(N < 5000\), but it is fine to perform here.)

# Your code here

Your answer here.

3.5.1 Box-Cox Transformation

Recall from our brief discussion in lecture that the Box-Cox transformation involves the parameter \(\lambda\), which when applied to \(y\) yields the transformation.

There are multiple ways to apply the Box-Cox transform in R, but we are going to take advantage of the one in caret to make our lives easier, BoxCoxTrans().

bc_data <- BoxCoxTrans(imputedTrain$PredictedReadmissionRate)
bc_data
## Box-Cox Transformation
## 
## 2266 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.90   15.14   16.27   16.46   17.62   25.19 
## 
## Largest/Smallest: 2.31 
## Sample Skewness: 0.473 
## 
## Estimated Lambda: -0.3

So, we can see here that the estimated lambda is -0.3; so not quite a full square root transform.

Now apply the Box-Cox transformation:

imputedTrain$bc_PredictedReadmissionRate <- predict(bc_data, imputedTrain$PredictedReadmissionRate)

Can you tell what it did to the distribution? Look closely, if not. It’s subtle but discernible.

Question 14: [1 point]

Now repeat all of the steps to perform a Box-Cox transform on your own to the imputedTest data. Why do you think we need to transform the test data too?

Your answer here.

# Your code here

# Make sure to check that your transform looks successful!

3.5.2 Centering & Scaling

Now that we have applied a Box-Cox transform to the target variable, we can center and scale the remainder of the variables. Note that we had to perform the Box-Cox prior to centering because the data has to be positive for a Box-Cox transformation (unless we switch to a Yeo-Johnson transform). Centering will likely make a lot of the data negative, plus we know we have negatives in our dataset purposefully.

Question 15: [1 point]

Make a copy of the imputed training set, drop our original, non-Box-Cox transformed target from the dataset, and then center and scale the whole training set using the scale() function in R. I called my dataset readyTrain, so if you name it anything else just note you’ll have to update subsequent code. Then, do the same thing for the test data!

# Your code here

## Make sure you both center AND scale your data!

If you’re worried you did it incorrectly, or you got stuck, you may load my dataset to proceed.

Hint: Make sure to turn your centered, scaled matrix back into a dataframe! By default, it creates a matrix.

load(file = "readyTrain.Rdata")
# readyTrain <- readyTrain %>% data.frame()

load(file = "readyTest.Rdata")
# readyTest <- readyTest %>% data.frame()

4 Unsupervised Learning Methods

4.1 Principal Component Analysis

It’s our friend, PCA again! We are going to leverage unsupervised learning methods, namely Principal Component Analysis and \(k\)-means clustering, to help us to further reduce the dimensionality of our dataset. The other advantage of this type of analysis is segmentation: it will allow us to determine if there are broader classifications of our hospitals that we need to report back to the advocacy group.

Let’s apply the PCA and display the scree plot of the first 15 principal components.

pca <- prcomp(readyTrain)
summary(pca)
## Importance of components:
##                          PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.951 1.76558 1.56187 1.47712 1.30130 1.24633 1.18033
## Proportion of Variance 0.256 0.09168 0.07175 0.06417 0.04981 0.04569 0.04098
## Cumulative Proportion  0.256 0.34773 0.41948 0.48366 0.53346 0.57915 0.62012
##                           PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     1.1411 1.04598 0.98838 0.94663 0.92540 0.90551 0.8746
## Proportion of Variance 0.0383 0.03218 0.02873 0.02636 0.02519 0.02412 0.0225
## Cumulative Proportion  0.6584 0.69060 0.71933 0.74569 0.77088 0.79499 0.8175
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.77922 0.76403 0.75386 0.70783 0.69926 0.66470 0.61676
## Proportion of Variance 0.01786 0.01717 0.01671 0.01474 0.01438 0.01299 0.01119
## Cumulative Proportion  0.83535 0.85252 0.86923 0.88397 0.89835 0.91134 0.92253
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.59823 0.57663 0.52832 0.49980 0.49838 0.48923 0.4664
## Proportion of Variance 0.01053 0.00978 0.00821 0.00735 0.00731 0.00704 0.0064
## Cumulative Proportion  0.93306 0.94284 0.95105 0.95839 0.96570 0.97274 0.9791
##                           PC29    PC30    PC31    PC32    PC33    PC34
## Standard deviation     0.41727 0.40916 0.38864 0.31460 0.30263 0.16190
## Proportion of Variance 0.00512 0.00492 0.00444 0.00291 0.00269 0.00077
## Cumulative Proportion  0.98426 0.98918 0.99362 0.99654 0.99923 1.00000

Let’s also look at this as a Scree Plot.

Question 16: [1 point]

Using the ‘elbow’ method of the scree plot - the point at which the percentage of variance explained seems to level off - approximately how many principal components would be significant to explain the total variation in the training data set? (Hint: if it’s not super obvious, that’s actually informative!)

Your answer here

Next, let’s explore how much the variables are contributing (this is based on the weight of their rotations). In PCA, we are interested in the loadings, i.e., the linear combination weights (coefficients) whereby unit-scaled components define or “load” a variable. Loadings help us interpret principal components.

You may have noticed we used the prcomp() function above; prcomp() will return the loadings in the variable $rotation, which contains a matrix of variable loadings. These loadings have been determined from the eigenvectors, which without getting into what that is (because we’d have to take a departure into linear algebra), suffice it to say that they are how we are measuring the direction and magnitude during each subsequent rotation of the principal components as we measure the variance explained.

Additionally, we are going to color this by our group variable in the dataset, ComparedToNational_Hospital.return.days.for.pneumonia.patients. Recall that when we explored this earlier this variable shows, relative to the national average, the performance of each hospital in terms of its pneumonia-related return-to-hospital days, where “Worse” is worse than the national average, so that hospital has a lot of patients returning with pneumonia and/or for a large number of days.

Question 17: [1 point]

On principal components 1 and 2 (the PCs that account for the most variance in the training data), do you see any pattern (i.e., clustering) relative to the hospital return days for pneumonia patients compared to the national average? Why or why not?

Your answer here

Next, let’s look at the contributions of the top 20 features; anything above the dotted red-line contributes significantly to the overall explanation of variance, and anything below the dotted line does not.

Notice that ALL of the top variables are based on patient survey ratings! We also see our target, PredictedReasmissionRate, among the significant variables.

4.2 k-means Clustering

\(k\)-means is a centroid-based clustering algorithm, where a “centroid” is the geometric center of an object often measured by Euclidean distance. In the algorithm, the distance between each data point and a centroid is calculated by randomly grabbing data; these distances are then used to assign the data point to a cluster. The goal is to identify the optimal \(k\) number of groups in a dataset by minimizing the distance of each datapoint to a respective centroid.

Although multiple methods exist to determine the optimal number of clusters, one commonly employed is heuristic, i.e., it isn’t really computational. You will either choose the optimal clusters based on a set of options of \(k\) graphically or you can once again employ the ‘elbow’ method. Let’s start by visually inspecting the effect of different \(k\) clusters on PC1 and PC2 of the dataset:

Question 18: [1 point]

Which number of clusters, \(k\), do you think has the best explanatory power? Is it hard to tell?

Your answer here

Next, let’s employ the ‘elbow’ method again, this time to look at when the within sum of squares drops off rather than the percent variance, as was done in PCA.

Question 19: [1 point]

What is the within sum of squares and what exactly is it measuring here? By the elbow method, which \(k\) is the optimal number of clusters? Did it agree with your choice from the other heuristic method?

Your answer here

4.3 Segmentation Analysis

As we said before, the goal of segmentation analysis is to broadly cluster our hospitals by their features to help us characterize them.

Question 20: [1 point]

Let’s explore the clusters - segments - of the hospitals based on their average values from the original dataset. Your task is to add comments to this code chunk.

kmeansResult <- kmeans(readyTrain,  
                         centers = 3,      ## COMMENT HERE
                         nstart = 50,      
                         iter.max = 1000,  
                         algorithm = "MacQueen")

## COMMENT HERE
kmeansTrain <- imputedTrain %>% 
  mutate(Cluster = kmeansResult$cluster) %>% 
  clean_names()

## COMMENT HERE (Hint: look at the PCA Importance Graph again if you aren't sure!)
pcaImportantVars <- c("overall_hospital_rating",
                      "care_transition",
                      "nurse_communication",
                      "recommend_hospital",
                      "staff_responsiveness",
                      "communication_about_medicines",
                      "doctor_communication",
                      "discharge_information",
                      "predicted_readmission_rate",
                      "quietness",
                      "median_time_minutes_patients_spent_in_ed",
                      "hospital_return_days_for_pneumonia_patients",
                      "emergency_department_volume", 
                      "compared_to_national_hospital_return_days_for_pneumonia_patients",
                      "cleanliness")

## COMMENT HERE
kmeansTrain %>% 
  ## COMMENT HERE
  select(all_of(pcaImportantVars), cluster) %>% 
  ## COMMENT HERE
  rename(`Hospital Return Days Score` = hospital_return_days_for_pneumonia_patients,
         `Hospital Return Days Nat'l Comparison` = compared_to_national_hospital_return_days_for_pneumonia_patients,
         `Median Minutes spent in Emergency Dept` = median_time_minutes_patients_spent_in_ed) %>% 
  ## COMMENT HERE
  clean_names(case = "title") %>% 
  group_by(Cluster) %>% 
  ## COMMENT HERE
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  ## COMMENT HERE
  pivot_longer(-1, names_to = "Measure", values_to = "Value") %>% 
  ## COMMENT HERE
    ggplot(aes(x = Cluster, y = Value, fill = as.factor(Cluster))) +
    geom_col(alpha = 0.75) +
    scale_fill_manual(values = myPal) +
    labs(title = "Cluster Means",
         x = "",
         fill = "Cluster") +
    theme_minimal() +
    theme(legend.position = "bottom",
        strip.text = element_text(size=6),
        legend.text = element_text(size = 8),
        legend.title = element_text(size = 10),
        title = element_text(size = 12, color = "maroon")) + 
    facet_wrap(~Measure, 
             scales = "free_y",
             ncol = 3)

Question 21: [1 point]

Notice, for example, that Cluster 1 hospitals seem to include the hospitals that had the below national average return hospital days (and thus the fewest return hospital days in terms of the quantitative score). Look at how else you might broadly characterize - or segment - these hospitals.

Your job is to make a table for our client, summarizing each of the four clusters. Give them an informative Title - e.g., “Highest Performing Hospitals” for cluster 1, perhaps?

Then come up with at least FOUR characteristics per cluster. You may choose to focus on different attributes for different clusters. Fill in the code for the table below. I start you out with two examples for Cluster 1, but feel free to add more! You are also welcome to rename Cluster 1 if you think of a catchier name than I chose!

Table 5. Hospital segmentation analysis: three types of broad groupings identified.
Top Defining Attributes Description
1: Highest Performing Hospitals
Return hospital days due to pneumonia Below national average and lowest return days overall
Emergency department (ED) Lowest median time for patients in the ED, and lowest ED volume
FILL ME IN FILL ME IN
FILL ME IN FILL ME IN
2: NAME ME!
FILL ME IN FILL ME IN
FILL ME IN FILL ME IN
FILL ME IN FILL ME IN
FILL ME IN FILL ME IN
3: NAME ME!
FILL ME IN FILL ME IN
FILL ME IN FILL ME IN
FILL ME IN FILL ME IN
FILL ME IN FILL ME IN

4.4 Final Remarks on Segmentation

We don’t technically have to stop our segmentation analysis here, but we will so that we can focus on other analyses. One thing we might choose to do in the future, for example, is to choose a classification machine learning algorithm to test the predictions based on these three clusters.

Question 22: [1 point]

Make a recommendation for our client for 1-2 machine learning analyses your team would choose to use to test whether these clusters can be used for robust predictions of hospital performance for pneumonia patients. How would you know whether or not your predictions were robust? What would you look for or compare?

Your answer here.

4 Supervised Learning Methods

We are going to perform two supervised learning methods: an Ordinary Least Squares (OLS) regression followed by an Elastic Net. However, before we do our OLS regression, we need to assess multicollinearity. Recall from lecture that multicollinearity is when one or more of our predictors are moderately to highly correlated with each other. This can inflate our coefficients and thereby cause spurious associations through false positives (i.e., small p-values when we don’t actually have them!).

Question 23: [1 point]

Could our PCA benefit from re-running it after dealing with possible multicollinearity? Explain your reasoning.

Your answer here.

4.1 Multicollinearity Assessment

There are two primary ways we assess multicollinearity, which we will implement here. The first is pairwise correlation, usually through a Pearson or Spearman correlation, and the other is by checking the variance inflation attributable to the features. Although it is not always necessary to check for multicollinearity it can be helpful to understand the underlying ‘landscape’ of your features and their relationship to both the target and to each other. For the types of analyses we plan to undertake, it’s a very necessary assessment to run.

4.1 Pairwise Correlation

Let’s make a correlation matrix and look at the correlation of the variables within the training dataset to assess possible areas of multicollinearity.

Question 24: [1 point]

Where do you see the highest potential for multicollinearity? Why do you say that?

Your answer here.

4.2 OLS Multiple Linear Regression

4.2.1 Fitting a model to estimate VIF

Another way we can assess multicollinearity is by fitting a multiple linear regression and estimating the variance inflation factors, or VIF, which estimates how much the variance of an estimated regression coefficient increases if your predictors are correlated. Specifically, variance inflation can identify multicollinearity when the VIF is greater than 4 and especially when it is above 10.

Fit a linear regression with the training data, using the Box-Cox transformed Predicted Readmission Rate as the target. The four plots are diagnostic plots for the primary assumptions of a linear regression: linearity between the outcome and the predictors, normally distributed residuals, homosecadasticity or equality of variances in the residuals, and no outliers or high leverage/influence points.

Question 25: [1 point]

You may have to do a little outside research, but interpret the four plots that have come off the regression. Do we largely meet the assumptions for an OLS multiple regression? Why or why not?

Your answer here.

Note: If you are not entirely sure about the Q-Q plot, you can run a Shapiro-Wilk test on the residuals using the following code:

shapiro.test(residuals(mod))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod)
## W = 0.99204, p-value = 0.0000000008371

4.2.2 Estimating the VIF (Variance Inflation Factors)

Next, let’s calculate the VIFs using the vif() function that is part of the car package. Note that I am not printing all of the VIFs, just the top 15 after sorting in descending order.

Table 6. Variance Inflation Factors after multiple linear regression
VIF
Overall.hospital.rating 24.65
Recommend.hospital 16.39
Nurse.communication 8.57
Care.transition 8.32
Staff.responsiveness 5.65
Communication.about.medicines 4.79
Doctor.communication 4.28
Discharge.information 3.58
Hospital.return.days.for.pneumonia.patients 3.57
Intensive.Care.Unit.Venous.Thromboembolism.Prophylaxis 3.37
ComparedToNational_Hospital.return.days.for.pneumonia.patients 3.27
Venous.Thromboembolism.Prophylaxis 3.19
Composite.patient.safety 2.73
Quietness 2.37
Median.time..minutes..patients.spent.in.ED 1.97
Question 26: [1 point]

Does multicollinearity seem to be an issue in this dataset and, if so, among which variables? Why do you come to this conclusion?

Your answer here.

4.2.3 Dropping the most collinear variables and re-running the OLS regression.

Even though we plan to proceed with an elastic net (which is much more robust to multicollinearity), to finish our OLS regression we will need to drop the most collinear variables. Note that I am making a copy of readyTrain first, called regressionTrain. We will do the same for readyTest also.

## Drop those with VIF over 4
regressionTrain <- readyTrain %>% 
  select(-Overall.hospital.rating, -Recommend.hospital, -Nurse.communication, -Care.transition, -Staff.responsiveness, -Communication.about.medicines)

## Do the same for the testing data!
regressionTest <- readyTest %>% 
  select(-Overall.hospital.rating, -Recommend.hospital, -Nurse.communication, -Care.transition, -Staff.responsiveness, -Communication.about.medicines)

Now, re-run the linear regression and let’s print the adjusted \(R^2\).

## [1] 0.6098681

It is a surprisingly high for how many features it has! (although it isn’t stellar, granted)

Use the step() function to do a backward regression to find our most parsimonious model. Stepwise regression drop terms that are not significant or important, thereby only keeping the features that contribute to the overall explanation of variance in predicted readmission rate.

## Perform a backward, stepwise regression to find the most parsimonious model
bestMod <- step(mod, 
            direction = "backward", 
            trace = FALSE)

Lastly, let’s take a peek at the results. We will use a package called stargazer to help us visualize the results in a nicely organized way.

## Print a table using stargazer
stargazer(bestMod, 
          type = "html",
          title = "Table 7. Parsimonious OLS Regression Results.")
Table 7. Parsimonious OLS Regression Results.
Dependent variable:
bc_PredictedReadmissionRate
MRSA.Bacteremia 0.048***
(0.014)
Payment.for.pneumonia.patients 0.224***
(0.018)
Abdomen.CT.Use.of.Contrast.Material 0.034**
(0.014)
Death.rate.for.pneumonia.patients -0.126***
(0.014)
Postoperative.respiratory.failure.rate 0.020
(0.013)
Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.020
(0.013)
Medicare.spending.per.patient 0.040**
(0.017)
Median.time..minutes..patients.spent.in.ED 0.112***
(0.015)
Intensive.Care.Unit.Venous.Thromboembolism.Prophylaxis 0.060***
(0.015)
Hospital.return.days.for.pneumonia.patients 0.113***
(0.024)
Doctor.communication -0.077***
(0.021)
Discharge.information 0.032
(0.020)
Cleanliness -0.038**
(0.016)
Quietness -0.064***
(0.018)
ExcessReadmissionRatio 0.469***
(0.017)
SurveyResponseRate -0.024
(0.015)
Emergency.department.volume 0.095***
(0.016)
ComparedToNational_Hospital.return.days.for.pneumonia.patients 0.036
(0.024)
State 0.033**
(0.014)
Constant 0.000
(0.013)
Observations 2,266
R2 0.614
Adjusted R2 0.610
Residual Std. Error 0.624 (df = 2246)
F Statistic 187.768*** (df = 19; 2246)
Note: p<0.1; p<0.05; p<0.01

Lastly, let’s assess how good a predictive model our OLS regression model is. We will use the predict() function to first fit our most parsimonious model bestMod to the test data:

## Make the predictions
predictions <- predict(bestMod, newdata = regressionTest)
## Calculate the RMSE
RMSE(predictions, regressionTest$bc_PredictedReadmissionRate)
## [1] 0.6631593
Question 27: [1 point]

Can you interpret what the RMSE tells your client here? Is this a good model? An okay model?

Hint 1: Remember that the outcome here is Box-Cox transformed. Would you need to undo the transformation to be able to assess what the RMSE is actually telling us? Recall that the Box-Cox \(\hat{y} = \frac{1}{y^\lambda}\) where our \(\lambda = -0.3\) and we also centered and scaled these data.

Hint 2: Is it practical to undo both transformations? How does that hinder our ability to make an interpretation for our client?

Your answer here.

Lastly, let’s graph the effect predictive ability of our OLS model by comparing the relationship between the true values of bc_PredictedReadmissionRate from the regressionTest dataset to the predictions we made using the model on the testing data.

4.3 The Elastic Net

A way to extend OLS regression is with the regularized regressions, such as Ridge, LASSO, and Elastic Net, which are more robust to multicollinearity by assessing penalties for non-zero variance. Typically, we would likely conduct a Ridge and/or LASSO before running an Elastic Net, but since we already know that (1) we have a fairly high amount of multicollinearity in our original readyTrain dataset and (2) we know we want to be able to perform feature reduction and figure out which features are important predictors of hospital readmissions, we’re going to jump straight into Elastic Net, which can perform both.

4.3.1 The Elastic Net Unpacked

In general, regularization is a technique that applies a penalty term to a cost function of a machine learning model, like the OLS regression. The reason is to discourage overfitting. This penalty term constrains the model’s coefficients, which limits how flexible they are during training. Why?

The more flexible the coefficients, the less likely they will “learn” new data, i.e., the more generalizable they are! Thus, by applying this penalty (constraint), it improves the model’s performance on “unseen” or new data.

As I’ve alluded, Elastic Net regression is a general regularization technique that combines the regularization techniques of Ridge and LASSO to help us perform both feature selection (i.e., significant coefficients) and feature reduction (importance). The elastic net can even help us with the \(P >> n\) problem we dealt with in gene expression data, so we could have applied that algorithm there as well!

Elastic net has two regularization terms, called L1 (or the Lasso regularization term) and L2 (the Ridge term).

  • L1: makes some of the coefficients zero, thereby selecting only the most important features (shrinks them to zero). The term often used is that this term encourages sparsity in the model.

  • L2: reduces the coefficients to small but non-zero values, thereby reducing the coefficients of the unimportant features (which we can use to determine significance). It does this by adding a penalty term, , to the cost function of the model, which is proportional to the coefficient squared. This allows us to retain all the features in the model but reduces the overall impact of the non-significant or less-important ones.

Thus, by combining these techniques together, the Elastic Net is balancing feature selection with feature reduction! If you’re interested in reading the original paper on Elastic Net by Zou and Hastie (2005), you can find a free PDF of the paper here.

4.3.2 Running, tuning, & cross-validating the Elastic Net

The elastic net can be run using the glmnet package (which we are accessing via caret), and can be cross-validated as random forest and SVM from our last project could be. This is a big improvement over OLS regression, which cannot be tuned. Additionally, the elastic net has two hyperparameters that we can tune, \(\alpha\) and \(\lambda\) (not to be confused with the \(\lambda\) from Box-Cox transformation - Greek letters are abundant in statistics!).

  • \(\alpha\) is the strength of the regularization, representing the balance between L1 and L2 regularization. Larger \(\alpha\) results in L1 regularization (feature selection / Ridge) whereas smaller \(\alpha\) results in more L2 regularization (higher shrinkage / feature reduction / Lasso). When \(\alpha=0\) the regression is equivalent to OLS regression! \(\alpha\) is also referred to as the mixing percent of L1 and L2 regularization.

  • \(\lambda\) is the shrinkage parameter, such that when \(\lambda = 0\) no shrinkage is performed and is equivalent to an OLS regression. As \(\lambda\) increases, the coefficients are increasingly shrunk and thus the more features will have coefficients shrunk to zero, regardless of the value of \(\alpha\).

We will start by setting our CV parameters just as we did on the last project, using the trainControl() function in caret.

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 5, 
                     search = "grid", 
                     verboseIter = FALSE)

Then, we will fit the model using the train() function in caret, specifying that we want to perform an elastic net, which will tune the model and perform CV per the specifications in the trainControl().

searchGrid <- expand.grid(.alpha = seq(0, 1, length.out = 10), .lambda = seq(0, 5, length.out = 15))
elasticMod <- train(bc_PredictedReadmissionRate ~ ., 
                    data = readyTrain[, -26], 
                    method = "glmnet", 
                    tuneGrid = searchGrid,
                    trControl = ctrl) 
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

The error we are getting is a result of a resampling issue with column 4, Death.rate.for.pneumonia.patients. There are two possible reasons why we might get this warning, the obvious one being that we forgot to drop NAs from the data. However, we can see that there are not any missing values, so that is not the issue:

sum(colSums(is.na(readyTrain)))
## [1] 0

The other reason would be because of failed convergence of the model. Thus, this suggests that our model failed to converge during cross-validation, likely because in some of the folds the predictions have zero variance.

Question 28: [1 point]

Explain what exactly we are doing here. Remind yourself of the parameters of the trainControl() and train() functions, and look up any new ones you do not yet know. Specifically, make sure to address what type of CV and hyperparameter search is being performed, as well as if both the \(\alpha\) and \(\lambda\) hyperparameters of elastic net are being tuned.

Your answer here.

4.3.2 Elastic Net Results

We are going to first take a look at the first 15 results of the Elastic Net in a table form.
Table 7. Results of the Elastic Net Tuning & 10-fold Cross-Validation
alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
0.00 0.00 0.72 0.48 0.57 0.03 0.04 0.03
0.00 0.36 0.73 0.47 0.58 0.04 0.04 0.03
0.00 0.71 0.74 0.46 0.59 0.04 0.04 0.03
0.00 1.07 0.75 0.46 0.60 0.04 0.04 0.03
0.00 1.43 0.76 0.45 0.61 0.04 0.04 0.03
0.00 1.79 0.77 0.44 0.61 0.04 0.04 0.03
0.00 2.14 0.78 0.44 0.62 0.04 0.04 0.03
0.00 2.50 0.79 0.43 0.63 0.04 0.04 0.03
0.00 2.86 0.80 0.43 0.63 0.04 0.04 0.03
0.00 3.21 0.80 0.42 0.64 0.04 0.04 0.03
0.00 3.57 0.81 0.42 0.64 0.04 0.04 0.03
0.00 3.93 0.81 0.42 0.65 0.04 0.04 0.03
0.00 4.29 0.82 0.41 0.65 0.04 0.04 0.03
0.00 4.64 0.83 0.41 0.66 0.04 0.04 0.03
0.00 5.00 0.83 0.41 0.66 0.04 0.04 0.03
0.11 0.00 0.72 0.48 0.57 0.04 0.04 0.03
0.11 0.36 0.74 0.47 0.58 0.04 0.04 0.03
0.11 0.71 0.77 0.46 0.61 0.04 0.04 0.03
0.22 0.00 0.72 0.48 0.57 0.04 0.04 0.03
0.22 0.36 0.75 0.46 0.60 0.04 0.04 0.03
0.33 0.00 0.72 0.48 0.57 0.04 0.04 0.03
0.33 1.43 0.98 0.33 0.78 0.04 0.05 0.03
0.44 0.00 0.72 0.48 0.57 0.04 0.04 0.03
0.44 1.07 0.98 0.32 0.78 0.04 0.05 0.03
0.56 0.00 0.72 0.48 0.57 0.04 0.04 0.03
0.67 0.00 0.72 0.48 0.57 0.04 0.04 0.03
0.67 0.71 0.97 0.31 0.77 0.04 0.05 0.03
0.78 0.00 0.72 0.48 0.57 0.04 0.04 0.03
0.89 0.00 0.72 0.48 0.57 0.04 0.04 0.03
1.00 0.00 0.72 0.48 0.57 0.04 0.04 0.03

Hmm, I am not seeing that any of the folds failed to converge (\(R^2\) = NA), although that is still the only other cause for the issue.

What were the hyperparameters of our best tuned model?

\(\alpha\):

## [1] 1

\(\lambda\):

## [1] 0
Question 28: [1 point]

Check the values of \(\alpha\) and \(\lambda\) here. What do they suggest about the optimal regularization that is being fit here? (Recall that when both \(\alpha\) and \(\lambda\) are zero, it’s an OLS regression!)

Your answer here.

Let’s also plot the results of the tuning search. Can you the type of tuning search you did? (Hint: the visual pattern should confirm your answer from earlier!)

Important Features

Important features can also be extracted, using the varImp() function from caret.

Question 29: [1 point]

Which of the features were most important? How does this compare to the most important features out of regression? How about about of PCA?

Your answer here.

Let’s also make predictions on the testing sets for the OLS and Elastic Net models using the postResample() function in caret; this will do predictions, but it will enable us to more easily generate metrics that we can use to compare model performance, such as \(R^2\) and MAE.

prOLS <- postResample(pred = predict(mod, newdata = regressionTest), 
                      obs = regressionTest$bc_PredictedReadmissionRate)

prElastic <- postResample(pred = predict(elasticMod, newdata = readyTest), 
                      obs = readyTest$bc_PredictedReadmissionRate)

## Display the output
rbind(prOLS, prElastic) %>% 
  kable(digits = 2,
    format = "html",
    caption = "Table 8. Comparison of the OLS and Elastic Net Regressions") %>%
    kable_styling(bootstrap_options = c("hover", full_width = F)
)
Table 8. Comparison of the OLS and Elastic Net Regressions
RMSE Rsquared MAE
prOLS 0.66 0.56 0.52
prElastic 0.77 0.41 0.61
Question 30: [1 point]

Using the same method, check for overfitting / underfitting. Make predictions using readyTrain rather than readyTest, and then check how the RMSE on the training data compares to the RMSE of the training data for the Elastic Net. How do they compare? (Recall what we discussed when we did the gene expression demo: the relationship of the test vs. training accuracy can be used to assess overfitting/underfitting. The same logic applies to RMSE too!) How do both compare to what we got out of the OLS regression? Which would you say is the better predictive model, the OLS regression or the Elastic Net? Why?

Your answer here.

# Your code here

A graphical comparison of the two models.

Unsurprisingly from the RMSE, these models are performing VERY similarly.

Question 31: [4 points]

What conclusions do you make at this point about whether to make predictions with an OLS regression, Elastic Net, or neither? What recommendations would you make to our client at this point? What other analyses might your suggest, or other data we might want to include? Please try to flesh out 2-3 recommendations in at least a paragraph.

Your answer here.

5 Next Steps

It’s time to choose your adventure! After working through this demo and answering the 31 questions, you will choose one of three possible trajectories. The following are some brief hints or tips for each of the choices available to you.

5.2 Choose your Adventure

Adventure 1. [Choosing one other condition to analyze.]

  • Make sure that you use the code I provided, updating it for your new condition
  • Make sure to answer all of the interpretative questions again for your new dataset.
  • I will specifically be looking for a comparison of which condition, pneumonia or the one you chose, seems to be a better choice for a predictive model for our client. Or is it neither?
  • Make sure to include some next steps or recommendations based on your new analysis!

Adventure 2. [Choosing two or more conditions to analyze.]

  • Choose a slightly more complicated analysis to undertake, for example, focusing on surgical interventions (HIP-KNEE & CABG) or heart-related conditions (HF, AMI, & CABG).
  • See the same recommendations and hints as above as above.

Adventure 3. [Continue analyzing the pneuomnia data.]

  • Make sure to justify your choice.
  • Choose at least two new algorithms to perform. They can be unsupervised or supervised, but must be 2+.
  • A natural extension would be to do a Ridge and a LASSO. Another option to consider might be a regression-based random forest, since we’ve already done something similar, but you are free to choose anything you think is appropriate here!
  • Other choices I think you could easily justify: (1) classification-based random forest using ComparedToNational_Hospital return days for pneumonia patients, (2) kNN using the three clusters we’ve identified and named in our segmentation analysis, or (3) an SVM, as the data are already prepared for that as well. And, although they may still be beyond your skillset, our data are also ready for a neural network analysis (e.g., a feed-forward NN)!

5.2 Deliverables

Unlike what you may have seen on Canvas, I am going to make things much lighter on you this time so we can all catch a breath. I am looking for 1-2 markdown documents with their knitted HTML. For example, if you’re choosing Adventures 1 or 2, you may want to work through this document once, make a copy, and then do it again for the new condition(s) you choose to work through. If you’re choosing Adventure 3, maybe you just choose to add on to the bottom of this document, replacing the ‘Next Steps’ section you see here.

Just make sure to answer the questions well, and make sure to justify the decisions you make. Tell me WHY you’re choosing the condition(s) you are in Adventures 1 or 2. Tell me WHY you’re doing the analyses you are in Adventure 3. Other than that, make this your own exploratory learning adventure!